#IT326 Project
Description of the dataset:
This dataset, obtained from vgchartz.com, provides a valuable
resource for exploring the dynamics between gaming platforms and genres
in the top 100 global video games. It enables us to analyze the
platforms that are influencing worldwide sales, identify the most
prosperous genres in various global regions, and track the evolving
trends in both platform preference and genre popularity over time.
Our goal:
Our goal from studying this dataset is to utilize classification and
clustering techniques on the input data to make predictions about the
popularity of upcoming games.
Attributes description:
| Rank |
Ranking of the game based on global sales. |
Numeric |
| Name |
Name of the game. |
Nominal |
| Platform |
Platform the game was released on. |
Nominal |
| Year |
Year the game was released. |
Ordinal |
| Genre |
Genre of the game |
Nominal |
| Publisher |
Publisher of the game. |
Nominal |
| NA_Sales |
Sales of the game in North America |
Numeric (ratio-scaled) |
| EU_Sales |
Sales of the game in Europe |
Numeric (ratio-scaled) |
| JP_Sales |
Sales of the game in Japan |
Numeric (ratio-scaled) |
| Other_Sales |
Sales of the game in other regions |
Numeric (ratio-scaled) |
| Global_Sales |
Total sales of the game worldwide |
Numeric (ratio-scaled) |
Class label:
Popular’ is our class label, we will use Global_Sales attribute to
predict whether a game will sell 1000000 or more globally. Our task of
data mining is regression.
Importing our dataset:
dataset=read.csv("vgsales.csv")
loading libraries needed for our data mining tasks:
library(outliers)
library(dplyr)
library(Hmisc)
library(ggplot2)
library(cowplot)
library(mlbench)
library(caret)
library(faux)
library(DataExplorer)
library(randomForest)
options(max.print=9999999)
General info about our dataset including number of rows and columns,
and cheking dimensionality and coulumns names:
nrow(dataset)
ncol(dataset)
dim(dataset)
names(dataset)
<<<<<<< HEAD
[1] 16598
ncol(dataset)
[1] 11
dim(dataset)
[1] 16598 11
names(dataset)
[1] "Rank" "Name" "Platform" "Year" "Genre" "Publisher" "NA_Sales" "EU_Sales" "JP_Sales" "Other_Sales"
[11] "Global_Sales"
=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
Dataset structure including number of coulums and rows, attribute
types:
str(dataset)
sample of raw dataset(first 10 rows):
head(dataset, 10)
<<<<<<< HEAD
=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
sample of raw dataset(last 10 rows):
tail(dataset, 10)
<<<<<<< HEAD
=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
summary of our dataset:
summary(dataset)
<<<<<<< HEAD
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales
Min. : 1 Length:16598 Length:16598 Length:16598 Length:16598 Length:16598 Min. : 0.0000 Min. : 0.0000
1st Qu.: 4151 Class :character Class :character Class :character Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000
Median : 8300 Mode :character Mode :character Mode :character Mode :character Mode :character Median : 0.0800 Median : 0.0200
Mean : 8301 Mean : 0.2647 Mean : 0.1467
3rd Qu.:12450 3rd Qu.: 0.2400 3rd Qu.: 0.1100
Max. :16600 Max. :41.4900 Max. :29.0200
JP_Sales Other_Sales Global_Sales
Min. : 0.00000 Min. : 0.00000 Min. : 0.0100
1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.0600
Median : 0.00000 Median : 0.01000 Median : 0.1700
Mean : 0.07778 Mean : 0.04806 Mean : 0.5374
3rd Qu.: 0.04000 3rd Qu.: 0.04000 3rd Qu.: 0.4700
Max. :10.22000 Max. :10.57000 Max. :82.7400
=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
variance of numeric data:
var(dataset$NA_Sales)
var(dataset$EU_Sales)
var(dataset$JP_Sales)
var(dataset$Other_Sales)
var(dataset$Global_Sales)
Graphs:
dataset2 <- dataset %>% sample_n(50)
tab <- dataset2$Platform %>% table()
precentages <- tab %>% prop.table() %>% round(3) * 100
txt <- paste0(names(tab), '\n', precentages, '%')
pie(tab, labels=txt , main = "Pie chart of Platform")
<<<<<<< HEAD

=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
We notice from the pie chart of platform attribute that releasing a
game for PS users will increase the popularity of the game since it is
the most common platform among gamers.
# coloring barplot and adding text
tab<-dataset$Genre %>% table()
precentages<-tab %>% prop.table() %>% round(3)*100
txt<-paste0(names(tab), '\n',precentages,'%')
bb <- dataset$Genre %>% table() %>% barplot(axisnames=F, main = "Barplot for Popular genres ",ylab='count',col=c('pink','blue','lightblue','green','lightgreen','red','orange','red','grey','yellow','azure','olivedrab'))
text(bb,tab/2,labels=txt,cex=1.5)
<<<<<<< HEAD

=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
In terms of genre, action games are the most popular, followed by
sports and music games. It is safe to assume that a high number of
genres of this nature exist due to their popularity and sales.
boxplot(dataset$NA_Sales , main="
BoxPlot for NA_Sales")
<<<<<<< HEAD

=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
boxplot(dataset$EU_Sales, main="
BoxPlot for EU_Sales")
<<<<<<< HEAD

=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
boxplot(dataset$JP_Sales , main="
BoxPlot for JP_Sales")
<<<<<<< HEAD

=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
boxplot(dataset$Other_Sales , main="
BoxPlot for Other_Sales")
<<<<<<< HEAD

=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
The boxplot of the Other-sales attribute indicate that the values are
close to each other ,and there is a lot of outliers since the dataset
represents the global sales of video games.
boxplot(dataset$Global_Sales , main="BoxPlot for Global_Sales")
<<<<<<< HEAD

=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
The boxplot of the Global-sales attribute indicate that the values
are close to each other ,and there is a lot of outliers since the
dataset represents the global sales of video games.
qplot(data = dataset, x=Global_Sales,y=Genre,fill=I("yellow"),width=0.5 ,geom = "boxplot" , main = "BoxPlots for genre and Global_Sales")
<<<<<<< HEAD
Warning: `qplot()` was deprecated in ggplot2 3.4.0.

=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
dataset$Year %>% table() %>% barplot( main = "Barplot for year")
<<<<<<< HEAD

=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
pairs(~NA_Sales + EU_Sales + JP_Sales + Other_Sales + Global_Sales, data = dataset,
main = "Sales Scatterplot")
<<<<<<< HEAD

=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
We used Scatterplot to determine the type of correlation we have
between the sales; we can see that the majority have positive
correlation with each other.
Pre - processing
Null checking
sum(is.na(dataset$Rank))
NullRank<-dataset[dataset$Rank=="N/A",]
NullRank
<<<<<<< HEAD
=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
checking for nulls in Rank (there is no nulls)
sum(is.na(dataset$Name))
[1] 0
NullName<-dataset[dataset$Name=="N/A",]
NullName
checking for nulls in name (there is no nulls)
sum(is.na(dataset$Platform))
NullPlatform<-dataset[dataset$Platform=="N/A",]
checking for nulls in Platform(there is no nulls)
<<<<<<< HEAD
sum(is.na(dataset$Year))
[1] 0
NullYear<-dataset[dataset$Year=="N/A",]
NullYear
=======
sum(is.na(dataset$Year))
NullYear<-dataset[dataset$Year=="N/A",]
NullYear
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
checking for nulls in year we won’t delete the null and we will leave
them as global constant because we want the sales data out of them.
sum(is.na(dataset$Genre))
NullGenre<-dataset[dataset$Genre=="N/A",]
NullGenre
<<<<<<< HEAD
=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
checking for nulls in Genre(there is no nulls)
sum(is.na(dataset$Publisher))
NullPublisher<-dataset[dataset$Publisher=="N/A",]
NullPublisher
<<<<<<< HEAD
=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
checking for nulls in Publisher. we won’t delete the null and we will
leave them as global constant as it is because we want the sales data of
them.
sum(is.na(dataset$Other_Sales))
NullOther_Sales<-dataset[dataset$Other_Sales=="N/A",]
There is no null values in the other_sales.
sum(is.na(dataset$Global_Sales))
NullGlobal_Sales<-dataset[dataset$Global_Saless=="N/A",]
There is no null values in the Global_Sales.
Encoding
dataset$Platform=factor(dataset$Platform,levels=c("2600","3DO","3DS","DC","DS","GB","GBA","GC","GEN","GG","N64","NES","NG","PC","PCFX","PS","PS2","PS3","PS4","PSP","PSV","SAT","SCD","SNES","TG16","Wii","WiiU","WS","X360","XB","XOne"), labels=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31))
Since most machine learning algorithms work with numbers and not with
text or categorical variables, this column will be encoded to facilitate
our data mining task.
dataset$Genre=factor(dataset$Genre,levels=c("Action","Adventure","Fighting","Platform","Puzzle","Racing","Role-Playing","Shooter","Simulation","Sports","Strategy","Misc"),labels=c(1,2,3,4,5,6,7,8,9,10,11,12))
Since most machine learning algorithms work with numbers and not with
text or categorical variables, this column will be encoded to facilitate
our data mining task.
Outliers
outlier of NA_Sales
OutNA_Sales = outlier(dataset$NA_Sales, logical =TRUE)
sum(OutNA_Sales)
Find_outlier = which(OutNA_Sales ==TRUE, arr.ind = TRUE)
OutNA_Sales
Find_outlier
outlier of EU_Sales
OutEU_Sales = outlier(dataset$EU_Sales, logical =TRUE)
sum(OutEU_Sales)
Find_outlier = which(OutEU_Sales ==TRUE, arr.ind = TRUE)
OutEU_Sales
Find_outlier
outlier of JP_Sales
OutJP_Sales = outlier(dataset$JP_Sales, logical =TRUE)
sum(OutJP_Sales)
Find_outlier = which(OutJP_Sales ==TRUE, arr.ind = TRUE)
OutJP_Sales
Find_outlier
outlier of other_sales
OutOS=outlier(dataset$Other_Sales, logical=TRUE)
sum(OutOS)
Find_outlier=which(OutOS==TRUE, arr.ind=TRUE)
OutOS
Find_outlier
outlier of Global_sales
OutGS=outlier(dataset$Global_Sales, logical=TRUE)
sum(OutGS)
Find_outlier=which(OutGS==TRUE, arr.ind=TRUE)
OutGS
Find_outlier
Remove outliers
dataset= dataset[-Find_outlier,]
Normalization
Dataset before normalization:
datsetWithoutNormalization<-dataset
normalize <- function(x) {return ((x - min(x)) / (max(x) - min(x)))}
dataset$NA_Sales<-normalize(datsetWithoutNormalization$NA_Sales)
dataset$EU_Sales<-normalize(datsetWithoutNormalization$EU_Sales)
dataset$JP_Sales<-normalize(datsetWithoutNormalization$JP_Sales)
dataset$Other_Sales<-normalize(datsetWithoutNormalization$Other_Sales)
dataset$Global_Sales<-normalize(datsetWithoutNormalization$Global_Sales)
We chose min-max normalization instead of z-score normalization
because min-max transform the data into a specific range, which enhances
its suitability for visualization and comparison. Additionally, it
simplifies the process of assessing attribute importance and their
contributions to the model.
Feautre selection
Our class label (popular) refers to Global_Sales. Other sales regions
will be evaluated based on their importance to (global_sales) column.
and those that are less important will be deleted from the dataset. use
roc_curve area as score
roc_imp <- filterVarImp(x = dataset[,7:10], y = dataset$Global_Sales)
sort the score in decreasing order
roc_imp <- data.frame(cbind(variable = rownames(roc_imp), score = roc_imp[,1]))
roc_imp$score <- as.double(roc_imp$score)
roc_imp[order(roc_imp$score,decreasing = TRUE),]
<<<<<<< HEAD
=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
we will remove the (JP_Sales) because it is of low importance to our
class_label(Global_Sales)
dataset<- dataset[,-9]
Dataset after pre-processing
print(dataset)
<<<<<<< HEAD
=======
>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f
<<<<<<< HEAD
---
output: html_notebook
---
#IT326 Project




# Description of the dataset:

This dataset, obtained from vgchartz.com, provides a valuable resource for exploring the dynamics between gaming platforms and genres in the top 100 global video games. It enables us to analyze the platforms that are influencing worldwide sales, identify the most prosperous genres in various global regions, and track the evolving trends in both platform preference and genre popularity over time. 

# Source and link:
Source: Kaggle

URL link: https://www.kaggle.com/datasets/gregorut/videogamesales

# Our goal:

Our goal  from studying this dataset is to utilize classification and clustering techniques on the input data to make predictions about the popularity of upcoming games.



# Attributes description:


| **Attributes name** | **Description**                   | **Data type** | 
|-----------------------------|-------------------------------------|---------------------|
|Rank               | Ranking of the game based on global sales. | Numeric       |
| Name            | Name of the game. | Nominal       | 
| Platform      | Platform the game was released on. | Nominal       | 
| Year               | Year the game was released. | Ordinal       | 
| Genre            | Genre of the game | Nominal       | 
| Publisher      | Publisher of the game. | Nominal       | 
| NA_Sales      | Sales of the game in North America | Numeric (ratio-scaled)       | 
| EU_Sales       | Sales of the game in Europe | Numeric (ratio-scaled)        | 
| JP_Sales        | Sales of the game in Japan | Numeric (ratio-scaled)        | 
| Other_Sales | Sales of the game in other regions | Numeric (ratio-scaled)        | 
| Global_Sales  | Total sales of the game worldwide | Numeric (ratio-scaled)     |     


# Class label:

Popular' is our class label, we will use Global_Sales attribute to predict whether a game will sell 1000000 or more globally. Our task of data mining is regression.









Importing our dataset:
```{r}
dataset=read.csv("vgsales.csv")
```


loading libraries needed for our data mining tasks:
```{r}
library(outliers) 
library(dplyr)
library(Hmisc)
library(ggplot2)
library(cowplot)
library(mlbench)
library(caret)
library(faux)
library(DataExplorer)
library(randomForest)
options(max.print=9999999)
```




General info about our dataset including  number of rows and columns, and cheking dimensionality and coulumns names:
```{r}
nrow(dataset)
ncol(dataset)
dim(dataset)
names(dataset)
```




Dataset structure including number of coulums and rows, attribute types:
```{r}
str(dataset)
```



sample of raw dataset(first 10 rows):
```{r}
head(dataset, 10)
```

sample of raw dataset(last 10 rows):
```{r}
tail(dataset, 10)
```

summary of our dataset:
```{r}
summary(dataset)
```

variance of numeric data:
```{r}
var(dataset$NA_Sales)
var(dataset$EU_Sales)
var(dataset$JP_Sales)
var(dataset$Other_Sales)
var(dataset$Global_Sales)
```






# Graphs:

```{r}
dataset2 <- dataset %>% sample_n(50)
tab <- dataset2$Platform %>% table()
precentages <- tab %>% prop.table() %>% round(3) * 100 
txt <- paste0(names(tab), '\n', precentages, '%') 

pie(tab, labels=txt , main = "Pie chart of Platform") 

```

We notice from the pie chart of platform attribute that releasing a game for PS users will increase the popularity of the game since it is the most common platform among gamers. 





```{r}
# coloring barplot and adding text
tab<-dataset$Genre %>% table() 

precentages<-tab %>% prop.table() %>% round(3)*100 

txt<-paste0(names(tab), '\n',precentages,'%') 

bb <- dataset$Genre %>% table() %>% barplot(axisnames=F, main = "Barplot for Popular genres ",ylab='count',col=c('pink','blue','lightblue','green','lightgreen','red','orange','red','grey','yellow','azure','olivedrab')) 

text(bb,tab/2,labels=txt,cex=1.5) 
```
In terms of genre, action games are the most popular, followed by sports and music games. It is safe to assume that a high number of genres of this nature exist due to their popularity and sales.





```{r}
boxplot(dataset$NA_Sales , main="
BoxPlot for NA_Sales")
```

```{r}
boxplot(dataset$EU_Sales, main="
 BoxPlot for EU_Sales")
```

```{r}
boxplot(dataset$JP_Sales , main="
 BoxPlot for JP_Sales")
```




```{r}
boxplot(dataset$Other_Sales , main="
 BoxPlot for Other_Sales") 
```  

The boxplot of the Other-sales attribute indicate that the values are close to each other ,and there is a lot of outliers since the dataset represents the global sales of video games. 




```{r}
boxplot(dataset$Global_Sales , main="BoxPlot for Global_Sales")

```  
The boxplot of the Global-sales attribute indicate that the values are close to each other ,and there is a lot of outliers since the dataset represents the global sales of video games. 




```{r}
qplot(data = dataset, x=Global_Sales,y=Genre,fill=I("yellow"),width=0.5 ,geom = "boxplot" , main = "BoxPlots for genre and Global_Sales")
```

```{r}
dataset$Year %>% table() %>% barplot( main = "Barplot for year")
```

```{r}
pairs(~NA_Sales + EU_Sales + JP_Sales + Other_Sales + Global_Sales, data = dataset,
      main = "Sales Scatterplot")
```    
We used Scatterplot to determine the type of correlation we have between the sales; we can see that the majority have positive correlation with each other. 
 
 
      
# Pre - processing

# Varaible transformation
```{r}
dataset$Rank=as.character(dataset$Rank)
```
We transformed the Rank from numric to char,because we will use them as ordinal data.

# Null checking
```{r}
sum(is.na(dataset$Rank))
NullRank<-dataset[dataset$Rank=="N/A",]
NullRank
```
checking for nulls in Rank (there is no nulls)
```{r}
sum(is.na(dataset$Name))
NullName<-dataset[dataset$Name=="N/A",]
NullName
```

checking for nulls in name (there is no nulls)

```{r}
sum(is.na(dataset$Platform))
NullPlatform<-dataset[dataset$Platform=="N/A",]


```
checking for nulls in Platform(there is no nulls)

```{r}
sum(is.na(dataset$Year))
NullYear<-dataset[dataset$Year=="N/A",]
NullYear
```
checking for nulls in year
we won't delete the null and we will leave them as global constant because we want the sales data out of them.


```{r}
sum(is.na(dataset$Other_Sales))
NullOther_Sales<-dataset[dataset$Other_Sales=="N/A",]


```
There is no null values in the other_sales.

```{r}
sum(is.na(dataset$Genre))
NullGenre<-dataset[dataset$Genre=="N/A",]
NullGenre
```
checking for nulls in Genre(there is no nulls)
```{r}
sum(is.na(dataset$Publisher))
NullPublisher<-dataset[dataset$Publisher=="N/A",]
NullPublisher
```
checking for nulls in Publisher.
we won't delete the null and we will leave them as global constant as it is because we want the sales data of them.


```{r}
sum(is.na(dataset$Global_Sales))
NullGlobal_Sales<-dataset[dataset$Global_Saless=="N/A",]


```
There is no null values in the Global_Sales.

# Encoding
```{r}
dataset$Platform=factor(dataset$Platform,levels=c("2600","3DO","3DS","DC","DS","GB","GBA","GC","GEN","GG","N64","NES","NG","PC","PCFX","PS","PS2","PS3","PS4","PSP","PSV","SAT","SCD","SNES","TG16","Wii","WiiU","WS","X360","XB","XOne"), labels=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31))
```
Since most machine learning algorithms work with numbers and not with text or categorical variables, this column will be encoded to facilitate our data mining task.

```{r}
dataset$Genre=factor(dataset$Genre,levels=c("Action","Adventure","Fighting","Platform","Puzzle","Racing","Role-Playing","Shooter","Simulation","Sports","Strategy","Misc"),labels=c(1,2,3,4,5,6,7,8,9,10,11,12))
```
Since most machine learning algorithms work with numbers and not with text or categorical variables, this column will be encoded to facilitate our data mining task.

# Outliers
outlier of NA_Sales
```{r}
OutNA_Sales = outlier(dataset$NA_Sales, logical =TRUE)
sum(OutNA_Sales)
Find_outlier = which(OutNA_Sales ==TRUE, arr.ind = TRUE)
OutNA_Sales
Find_outlier
```
outlier of EU_Sales
```{r}
OutEU_Sales = outlier(dataset$EU_Sales, logical =TRUE)
sum(OutEU_Sales)
Find_outlier = which(OutEU_Sales ==TRUE, arr.ind = TRUE)
OutEU_Sales
Find_outlier
```
outlier of JP_Sales
```{r}
OutJP_Sales = outlier(dataset$JP_Sales, logical =TRUE)
sum(OutJP_Sales)
Find_outlier = which(OutJP_Sales ==TRUE, arr.ind = TRUE)
OutJP_Sales
Find_outlier
```

outlier of other_sales 
```{r}
OutOS=outlier(dataset$Other_Sales, logical=TRUE)  
sum(OutOS)  
Find_outlier=which(OutOS==TRUE, arr.ind=TRUE)  
OutOS 
Find_outlier 

```


outlier of Global_sales 

```{r}
OutGS=outlier(dataset$Global_Sales, logical=TRUE)  
sum(OutGS)  
Find_outlier=which(OutGS==TRUE, arr.ind=TRUE)  
OutGS 
Find_outlier 

```



# Remove outliers 
```{r}
dataset= dataset[-Find_outlier,]
```



# Normalization
Dataset before normalization:
```{r}
datsetWithoutNormalization<-dataset
```


```{r}
normalize <- function(x) {return ((x - min(x)) / (max(x) - min(x)))}
dataset$NA_Sales<-normalize(datsetWithoutNormalization$NA_Sales)
dataset$EU_Sales<-normalize(datsetWithoutNormalization$EU_Sales)
dataset$JP_Sales<-normalize(datsetWithoutNormalization$JP_Sales)
dataset$Other_Sales<-normalize(datsetWithoutNormalization$Other_Sales)
dataset$Global_Sales<-normalize(datsetWithoutNormalization$Global_Sales)
```
We chose min-max normalization instead of z-score normalization because min-max transform the data into a specific range, which enhances its suitability for visualization and comparison. Additionally, it simplifies the process of assessing attribute importance and their contributions to the model.





# Feautre selection
Our class label (popular) refers to Global_Sales. Other sales regions will be evaluated based on their importance to (global_sales) column. and those that are less important will be deleted from the dataset.
use roc_curve area as score
```{r}
roc_imp <- filterVarImp(x = dataset[,7:10], y = dataset$Global_Sales)
```
sort the score in decreasing order
```{r}
roc_imp <- data.frame(cbind(variable = rownames(roc_imp), score = roc_imp[,1]))
roc_imp$score <- as.double(roc_imp$score)
roc_imp[order(roc_imp$score,decreasing = TRUE),]
```
we will remove the (JP_Sales) because it is of low importance to our class_label(Global_Sales)
```{r}
dataset<- dataset[,-9]
```

# Dataset after pre-processing
```{r}
print(dataset)
```


=======
---
output: html_notebook
---
#IT326 Project




# Description of the dataset:

This dataset, obtained from vgchartz.com, provides a valuable resource for exploring the dynamics between gaming platforms and genres in the top 100 global video games. It enables us to analyze the platforms that are influencing worldwide sales, identify the most prosperous genres in various global regions, and track the evolving trends in both platform preference and genre popularity over time. 

# Source and link:
Source: Kaggle

URL link: https://www.kaggle.com/datasets/gregorut/videogamesales

# Our goal:

Our goal  from studying this dataset is to utilize classification and clustering techniques on the input data to make predictions about the popularity of upcoming games.



# Attributes description:


| **Attributes name** | **Description**                   | **Data type** | 
|-----------------------------|-------------------------------------|---------------------|
|Rank               | Ranking of the game based on global sales. | Numeric       |
| Name            | Name of the game. | Nominal       | 
| Platform      | Platform the game was released on. | Nominal       | 
| Year               | Year the game was released. | Ordinal       | 
| Genre            | Genre of the game | Nominal       | 
| Publisher      | Publisher of the game. | Nominal       | 
| NA_Sales      | Sales of the game in North America | Numeric (ratio-scaled)       | 
| EU_Sales       | Sales of the game in Europe | Numeric (ratio-scaled)        | 
| JP_Sales        | Sales of the game in Japan | Numeric (ratio-scaled)        | 
| Other_Sales | Sales of the game in other regions | Numeric (ratio-scaled)        | 
| Global_Sales  | Total sales of the game worldwide | Numeric (ratio-scaled)     |     


# Class label:

Popular' is our class label, we will use Global_Sales attribute to predict whether a game will sell 1000000 or more globally. Our task of data mining is regression.




```{r}
dataset=read.csv("vgsales.csv")
```
Importing our dataset.


```{r}
library(outliers) 
library(dplyr)
library(Hmisc)
library(ggplot2)
library(cowplot)
library(mlbench)
library(caret)
library(faux)
library(DataExplorer)
library(randomForest)
options(max.print=9999999)
```

loading libraries needed for our data mining tasks.


```{r}
nrow(dataset)
ncol(dataset)
dim(dataset)
names(dataset)
```
General info about our dataset including  number of rows and columns, and cheking dimensionality and coulumns names.

```{r}
str(dataset)
```
Dataset structure including number of coulums and rows, attribute types. 

```{r}
head(dataset, 10)
```
sample of raw dataset(first 10 rows).

```{r}
tail(dataset, 10)
```
sample of raw dataset(last 10 rows).

```{r}
summary(dataset)
```
summary of our dataset.

```{r}
var(dataset$NA_Sales)
var(dataset$EU_Sales)
var(dataset$JP_Sales)
var(dataset$Other_Sales)
var(dataset$Global_Sales)
```
variance of numeric data

# Graphs:

```{r}
dataset2 <- dataset %>% sample_n(50)
tab <- dataset2$Platform %>% table()
precentages <- tab %>% prop.table() %>% round(3) * 100 
txt <- paste0(names(tab), '\n', precentages, '%') 

pie(tab, labels=txt , main = "Pie chart of Platform") 

```

We notice from the pie chart of platform attribute that releasing a game for PS users will increase the popularity of the game since it is the most common platform among gamers. 





```{r}
# coloring barplot and adding text
tab<-dataset$Genre %>% table() 

precentages<-tab %>% prop.table() %>% round(3)*100 

txt<-paste0(names(tab), '\n',precentages,'%') 

bb <- dataset$Genre %>% table() %>% barplot(axisnames=F, main = "Barplot for Popular genres ",ylab='count',col=c('pink','blue','lightblue','green','lightgreen','red','orange','red','grey','yellow','azure','olivedrab')) 

text(bb,tab/2,labels=txt,cex=1.5) 
```
In terms of genre, action games are the most popular, followed by sports and music games. It is safe to assume that a high number of genres of this nature exist due to their popularity and sales.


```{r}
boxplot(dataset$NA_Sales , main="
BoxPlot for NA_Sales")
```

```{r}
boxplot(dataset$EU_Sales, main="
 BoxPlot for EU_Sales")
```

```{r}
boxplot(dataset$JP_Sales , main="
 BoxPlot for JP_Sales")
```




```{r}
boxplot(dataset$Other_Sales , main="
 BoxPlot for Other_Sales") 
```  

The boxplot of the Other-sales attribute indicate that the values are close to each other ,and there is a lot of outliers since the dataset represents the global sales of video games. 




```{r}
boxplot(dataset$Global_Sales , main="BoxPlot for Global_Sales")

```  
The boxplot of the Global-sales attribute indicate that the values are close to each other ,and there is a lot of outliers since the dataset represents the global sales of video games. 




```{r}
qplot(data = dataset, x=Global_Sales,y=Genre,fill=I("yellow"),width=0.5 ,geom = "boxplot" , main = "BoxPlots for genre and Global_Sales")
```

```{r}
dataset$Year %>% table() %>% barplot( main = "Barplot for year")
```

```{r}
pairs(~NA_Sales + EU_Sales + JP_Sales + Other_Sales + Global_Sales, data = dataset,
      main = "Sales Scatterplot")
```    
We used Scatterplot to determine the type of correlation we have between the sales; we can see that the majority have positive correlation with each other. 
 
 
      
# Pre - processing

# Varaible transformation
```{r}
dataset$Rank=as.character(dataset$Rank)
```
Rank should be char and not numeric,because we will use them as ordinal data.

# Null checking
```{r}
sum(is.na(dataset$Rank))
NullRank<-dataset[dataset$Rank=="N/A",]
NullRank
```
checking for nulls in Rank (there is no nulls)
```{r}
sum(is.na(dataset$Name))
NullName<-dataset[dataset$Name=="N/A",]
NullName
```

checking for nulls in name (there is no nulls)

```{r}
sum(is.na(dataset$Platform))
NullPlatform<-dataset[dataset$Platform=="N/A",]


```
checking for nulls in Platform(there is no nulls)

```{r}
sum(is.na(dataset$Year))
NullYear<-dataset[dataset$Year=="N/A",]
NullYear
```
checking for nulls in year
we won't delete the null and we will leave them as global constant as it is because we want the sales data of them.

```{r}
sum(is.na(dataset$Genre))
NullGenre<-dataset[dataset$Genre=="N/A",]
NullGenre
```
checking for nulls in Genre(there is no nulls)

```{r}
sum(is.na(dataset$Publisher))
NullPublisher<-dataset[dataset$Publisher=="N/A",]
NullPublisher
```
checking for nulls in Publisher.
we won't delete the null and we will leave them as global constant as it is because we want the sales data of them.

```{r}
sum(is.na(dataset$Other_Sales))
NullOther_Sales<-dataset[dataset$Other_Sales=="N/A",]


```
There is no null values in the other_sales.

```{r}
sum(is.na(dataset$Global_Sales))
NullGlobal_Sales<-dataset[dataset$Global_Saless=="N/A",]


```
There is no null values in the Global_Sales.

# Encoding
```{r}
dataset$Platform=factor(dataset$Platform,levels=c("2600","3DO","3DS","DC","DS","GB","GBA","GC","GEN","GG","N64","NES","NG","PC","PCFX","PS","PS2","PS3","PS4","PSP","PSV","SAT","SCD","SNES","TG16","Wii","WiiU","WS","X360","XB","XOne"), labels=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31))
```
Since most machine learning algorithms work with numbers and not with text or categorical variables, this column will be encoded.

```{r}
dataset$Genre=factor(dataset$Genre,levels=c("Action","Adventure","Fighting","Platform","Puzzle","Racing","Role-Playing","Shooter","Simulation","Sports","Strategy","Misc"),labels=c(1,2,3,4,5,6,7,8,9,10,11,12))
```
Since most machine learning algorithms work with numbers and not with text or categorical variables, this column will be encoded.


# Outliers
outlier of NA_Sales
```{r}
OutNA_Sales = outlier(dataset$NA_Sales, logical =TRUE)
sum(OutNA_Sales)
Find_outlier = which(OutNA_Sales ==TRUE, arr.ind = TRUE)
OutNA_Sales
Find_outlier
```
outlier of EU_Sales
```{r}
OutEU_Sales = outlier(dataset$EU_Sales, logical =TRUE)
sum(OutEU_Sales)
Find_outlier = which(OutEU_Sales ==TRUE, arr.ind = TRUE)
OutEU_Sales
Find_outlier
```
outlier of JP_Sales
```{r}
OutJP_Sales = outlier(dataset$JP_Sales, logical =TRUE)
sum(OutJP_Sales)
Find_outlier = which(OutJP_Sales ==TRUE, arr.ind = TRUE)
OutJP_Sales
Find_outlier
```

outlier of other_sales 
```{r}
OutOS=outlier(dataset$Other_Sales, logical=TRUE)  
sum(OutOS)  
Find_outlier=which(OutOS==TRUE, arr.ind=TRUE)  
OutOS 
Find_outlier 

```


outlier of Global_sales 

```{r}
OutGS=outlier(dataset$Global_Sales, logical=TRUE)  
sum(OutGS)  
Find_outlier=which(OutGS==TRUE, arr.ind=TRUE)  
OutGS 
Find_outlier 

```



# Remove outliers 
```{r}
dataset= dataset[-Find_outlier,]
```



# Normalization

```{r}
datsetWithoutNormalization<-dataset
```
dataset before normalization

```{r}
normalize <- function(x) {return ((x - min(x)) / (max(x) - min(x)))}
dataset$NA_Sales<-normalize(datsetWithoutNormalization$NA_Sales)
dataset$EU_Sales<-normalize(datsetWithoutNormalization$EU_Sales)
dataset$JP_Sales<-normalize(datsetWithoutNormalization$JP_Sales)
dataset$Other_Sales<-normalize(datsetWithoutNormalization$Other_Sales)
dataset$Global_Sales<-normalize(datsetWithoutNormalization$Global_Sales)
```
min-max normalization
we will use the min-max normalization; it's better for visualization.


# Feautre selection
Our class label (popular) refers to Global_Sales. Other sales regions will be evaluated based on their importance to (global_sales) column. and those that are less important will be deleted from the dataset.
use roc_curve area as score
```{r}
roc_imp <- filterVarImp(x = dataset[,7:10], y = dataset$Global_Sales)
```
sort the score in decreasing order
```{r}
roc_imp <- data.frame(cbind(variable = rownames(roc_imp), score = roc_imp[,1]))
roc_imp$score <- as.double(roc_imp$score)
roc_imp[order(roc_imp$score,decreasing = TRUE),]
```
we will rmove the (JP_Sales) because it is of low importance to our class_label(Global_Sales)
```{r}
dataset<- dataset[,-9]
```

# Dataset after pre-processing
```{r}
print(dataset)
```


>>>>>>> 0a8dbb82ad1438b1cf018ebed6baaf49963f803f